Toronto AI Lab
LATTE3D
LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis

Kevin Xie *,1,2,3
Jonathan Lorraine *,1,2,3
Tianshi Cao *,1,2,3
Jun Gao 1,2,3
James Lucas 1
Antonio Torralba 1,4
Sanja Fidler 1,2,3
Xiaohui Zeng 1,2,3

1NVIDIA
2University of Toronto
3Vector Institute
4MIT
*Authors contributed equally

European Conference on Computer Vision (ECCV) 2024


LATTE3D generates high-quality textured meshes from text robustly in just 400ms by combining 3D priors, amortized optimization, and a second stage of surface rendering.

Abstract: Recent text-to-3D generation approaches produce impressive 3D results but require time-consuming optimization that can take up to an hour per prompt. Amortized methods like ATT3D optimize multiple prompts simultaneously to improve efficiency, enabling fast text-to-3D synthesis. However, they cannot capture high-frequency geometry and texture details and struggle to scale to large prompt sets, so they generalizes poorly. We introduce LATTE3D, addressing these limitations to achieve fast, high-quality generation on a significantly larger prompt set. Key to our method is 1) building a scalable architecture and 2) leveraging 3D data during optimization through 3D-aware diffusion priors, shape regularization, and model initialization to achieve robustness to diverse and complex training prompts. LATTE3D amortizes both neural field and textured surface generation to produce highly detailed textured meshes in a single forward pass. LATTE3D generates 3D objects in 400ms, and can be further enhanced with fast test-time optimization.




Paper

Kevin Xie, Jonathan Lorraine, Tianshi Cao, Jun Gao,
James Lucas, Antonio Torralba, Sanja Fidler, Xiaohui Zeng

LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis

ECCV 2024 Logo


[arXiv]
 
[30s Demo]
 
[30s Summary]
 
[3min Demo]
 
[3min Overview]
 
[Overview Slides]
 
[Bibtex]



Overview


Generating 3D objects has seen recent success on multiple fronts: quality (e.g., via surface rendering as in Magic3D), prompt-robustness (e.g., via 3D priors as in MVDream), and real-time generation (e.g., via amortized optimization as in ATT3D). We combine these benefits into a text-to-3D pipeline, allowing real-time generation of high-quality assets for a wide range of text prompts.


Use case: We generate high-quality 3D assets in only 400ms for a wide range of text prompts, with an option to regularize towards a user-specified 3D shape.





Our Method


LATTE3D consists of two stages: First, we use volumetric rendering to train the texture and the geometry. To enhance robustness to the prompts, the training objective includes an SDS gradient from a 3D-aware image prior and a regularization loss comparing the masks of a predicted shape with 3D assets in a library. Second, we use surface-based rendering and train only the texture to enhance quality. Both stages used amortized optimization over a set of prompts to maintain fast generation.




Our method uses two networks: a texture network T and geometry network G, both consisting of a combination of triplanes and U-Nets. In the first stage, the encoders of both networks share the same set of weights. In the second stage, we freeze the geometry network G and update the texture network T, and further upsample the triplanes with an MLP inputting the text embedding.




LATTE3D Usage Demo


Our trained models allow users to provide various text prompts and interactively view high-quality 3D assets. We improve the user experience by enhancing the 3D assets (a) quality, (b) generation speed, and (c) diversity of supported prompts.





Result Visualizations


To test generalization to arbitrary prompts, we train our model on a larger set of 100k prompts, constructed by using ChatGPT to augment the captions of the lvis Objaverse subset. We generalize to unseen, in-distribution augmented captions. Even more, we generalize to unseen, out-of-distribution prompts from DreamFusion.


Generalization on unseen, in-distribution (gpt-101k) prompts

Generalization on unseen, out-of-distribution (DreamFusion) prompts

Stylization

We investigate our method for user stylization, enabled by our optional point cloud input. We train on a large set of prompts for realistic animals, where each user-supplied point cloud is stylized for varying text prompts.



We expand the realistic animal stylization by training our method on animals composed with different styles. Our training prompts follow the template "object A in style B is doing C." We hold out combinations of objects, styles, and activities, showing similar generalization as ATT3D, with higher quality from our surface rendering in stage 2.


Stylization: Seen, training prompts for stylized animals

Stylization: Generalization on unseen, combinatorial prompts for stylized animals




Comparisons with other Text-to-3D Methods


We compare against MVDream, 3DTopia, LGM, and ATT3D, where we show MVDream after 6 minutes, 30 minutes, and 1 hour of optimization. For ATT3D and LATTE3D, we show the inference on unseen prompts, which require 400ms. We plot the inference time and user study preference rates of different methods compared to LATTE3D, where a rate below 50 indicates a preference for LATTE3D on average. We desire methods that are in the upper-left of the figure.


We provide a list of the DreamFusion prompts here: df67 prompts. Below, we include examples of the results used in the user study.


Results on (unseen) DreamFusion prompts



Benefit: Quickly Assemble Scenes


Users can quickly design entire scenes with one of our models by rapidly iterating on an individual object's design or the collection of objects they use. We generate each prompt's result at interactive rates, with up to 4 samples per prompt on an A6000 GPU.





Benefit: Enhanced Quality with Test-time Optimization


We support an optional, fast test-time optimization when a user desires a further quality boost on any prompt.


Generation Time
"Latte with intricate leaf design latte art in an elegant porcelain cup"
"a cafe latte in a large glass"
art in an elegant porcelain cup"
400ms (no test-time opt.)
5min (with test-time opt.)



Benefit: Enhanced User-controllability via Interpolations


We allow users to guide generalization towards a user-provided shape (as a point-cloud), which is a 3D analog of image-conditioning for text-image generation. To do this, we also amortize optimization over a point cloud regularization weight, which a user can control cheaply at inference time. When the weight is high, we recover the point cloud's shape, while when the weight is low, the text prompt primarily guides the model's generation.


User-provided shape

User-provided text-prompt

"DSLR photo of Garden snail"
"DSLR photo of a Domestic Cat"
"DSLR photo of Eastern Grey Kangaroo"


Interpolating between a user-provided shape and text prompt



Downstream Application: Text-to-4D


We use LATTE3D to initialize text-to-4D methods, such as Align Your Gaussians.


Text prompt for animation
"A dog running fast"
"A garden snail crawling"
"A kangaroo jumping"
"DSLR photo of Rottweiler"
"DSLR photo of Garden snail"
"DSLR photo of Eastern Grey Kangaroo"
Text prompt for object



Additional Comparison Results

GPT-generated unseen gpt-101k prompts
Seen gpt-101k prompts

LATTE3D stylization on unseen, combinatorial animal-style prompts

MVDream (6min, 36min, 60min)
~ATT3D(<0.4s)
LATTE3D(0.4s)



Citation


Xie, K., Lorraine, J., Cao, T., Gao, J., Lucas, J., Torralba, A., Fidler, S., & Zeng, X. (2024). LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis. The 18th European Conference on Computer Vision (ECCV).


@article{xie2024latte3d,
  title = {LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis},
  author = {Kevin Xie and Jonathan Lorraine and Tianshi Cao and Jun Gao and James Lucas and Antonio Torralba and Sanja Fidler and Xiaohui Zeng},
  journal = {The 18th European Conference on Computer Vision (ECCV)},
  year = {2024},
}


We thank David Acuna for the website template.